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Abstract 

A useful method for representing Bayesian classifiers is through discriminant 
functions. Here, using copula functions, we propose a new model for discrim- 
inants. This model provides a rich and generalized class of decision boundaries. 
These decision boundaries significantly boost the classification accuracy especially 
for high dimensional feature spaces. We strengthen our analysis through simula- 
tion results. 

1 Introduction 

Pattern classification is an important task in several image processing, statistical learn- 
ing, and data mining applications. The most popular pattern classifiers are Bayesian 
classifiers. There are many well known methods for representing Bayesian classifiers, 
but one of the most useful method is by discriminant functions. These functions pro- 
vide inter-class decision surfaces for Bayesian classifiers. 

Discriminant functions assume several forms depending on the probability density 
of the feature space. But most attention has been received by discriminant functions 
that assume multivariate Gaussian distribution HI. This attention has largely been due 
to analytical tractability of the multivariate Gaussian distribution. Obviously, a multi- 
variate Gaussian distribution assumes that the marginals are univariate Gaussian dis- 
tributed. However, this assumption is clearly impertinent in a large number of statistical 
learning problems. 

In this paper, using copula functions, we build a rich and generalized class of dis- 
criminant functions. These discriminant functions model the joint dependence structure 
independent of marginal distributions. We name these discriminant functions as copula 
discriminant functions. We believe, until now copula functions have never been used 
for pattern recognition. However, they have recently been used in areas like computa- 
tional finance and computational geology. 

Our approach results in a 25% to 45% increase in classification accuracy. Similar 
to SVMs (Support Vector Machines), copula discriminants exhibit exemplary stability 
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at higher dimensions. High dimensional feature spaces typically appear in tasks like 
text classification, image processing, and natural language processing. 

We begin by introducing copula functions. Copula functions (also, copulas) are 
advanced statistical tool for modeling and estimation of multivariate probability density 
functions (pdfs). In a multivariate distribution estimation setting, copulas enable us to 
separate the joint dependence structure from the marginal distribution. This elegant 
separation also allows us to create arbitrary distribution functions with known joint 
dependence structure, and without imposing restrictions on the marginal distributions. 
We use this property to derive decision boundaries which are sensitive to marginal 
distributions. 

The rest of the paper is organized as follows: in Section (|2]i we briefly introduce 
Bayesian classifiers. Then in Section (O we introduce copulas and use them to model 
decision boundaries. Estimation of copula functions is discussed in Section (01). We 
supplement our approach by simulation results in Section (|5]). Lastly, we draw conclu- 
sions in Section (|6]l. 

2 Bayesian Decision with Multivariate Normal Discrim- 
inant Function 

We briefly introduce the Bayesian decision technique for building Bayesian classifiers. 

Let {oji, . . . , LUc} be a finite set of c states of nature, or classes. Let the feature 
space be X G M*^. The problem of classification is to assign each x a label uji, . . . ,ujc. 
Let gi{x), i — 1, . . . , c be discriminant functions, a classifier is said to assign x a label 
LOi if, 

9i (x) > 9j (x) for afl j ^ i. (1) 

Also, it is worthwhile to note that, for any monotonically increasing function Q(-), 
Q{gi{x)) will keep the classification unaltered [Ij. A well known form of gi{x) is the 
Bayes decision surface, 

/(x|w,)P(wj) 

\nf{^\uj,)+\nP{oj,), (2) 

where /(•) is the probability density function, P{-) is the probability mass function, 
and 'In' denotes natural logarithm. Classifiers obtained from (|2]l are known as Bayesian 
Classifiers. These classifiers achieve a minimum-error-rate classification 1 1]. 

Evaluating (|2]l for /(xjcj,) ^ S^) gives us the normal discriminant function, 

g,(x) = -i(x-/.,)*S-^(x~Ai,) - ^ln(27r) - ^ In |S,| + ln(P(c.,)). 0) 

The assumption that f{x\uji) ^ S^) tacitly implies that Xj ^ N{^ij, {o''jj)i)- 

Thus a normal discriminant cannot accurately classify samples whose joint dependence 
structure is different from its marginal distribution. Recall, this is mainly due to lack 
of tractably in such problems. 



gi{x) = P{uJi\x) = 
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In the next section we will model discriminant functions using copulas. Copu- 
las allow us to tackle problems mentioned earlier without loosing analytical tractabil- 
ity. Using copulas, joint distributions can be modeled incorporating idiosyncrasies of 
marginal distributions. 

3 Copulas for Modeling Multivariate Probability Den- 
sity Functions 

We begin by rigorously defining copulas and their necessary components. Then, us- 
ing copula functions we propose discriminant functions for Bayes classifiers that are 
superior and flexible than the multivariate normal density. 

Definition 3.1 (Copula) A d-dimensional copula is a function C : [0, 1]'' i-^ [0, 1] and 

having the following properties: 

1. Let u e [0, 1]'', thus u = {wi, . . . , Ud\, where uj e [0, 1] Vj e {1, . . . , d}, then 
C (u) is increasing in each component, 

2. C(u) — if at least one coordinate Uj = 0, 

3. C(u) — Uk ifuj = k, 

4. For every a, b G [0, 1]'' with a < b, and a hypercube B — [a, b] = [ai, bi] x 
[12,^2] X ■ • • X [o-d, bd] whose vertices lie in the domain of C, we have volume 
Fc(B) >0.Q 

One of the most pivotal theorems in copula theory is Sklar's theorem. It provides a 
relationship between multivariate distributions and their univariate counterparts. 

Theorem 3.1 (Sklar's theorem) Let F be a d-dimensional distribution function with 
margins Fi, F2, . . . , F^. Then there exits an d-copula C such that for all xi,X2-, ■ ■ ■ ,Xd 
in W^, 

F{xi,X2, . . . , Xd) = CiFi{xi),F2ix2), . . . , Fdixd)). (4) 

If Fi^ F2, ■ ■ ■ , Fd are continuous, then C is unique; otherwise, C is uniquely deter- 
mined on Ran{Fi) x Ran{F2) • • • x J?an(_Fd)jl Conversely, if C is an d-copula and 
Fi, F2, ■ . . , Fd are distribution functions, then the function F defined by (0 is an d- 
dimensional distribution function with margins Fi, F2, . . . , Fd- 

' Vc(B) is defined as follows, 

2 2 

for all . . . , Jid,i e [0, 1]'' and ui,2, . . . , U(i,2 € [0, 1]'' with u„^i < u„,2- 
^Ran{F) indicates range of the function F(-). 
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(a) (b) (c) 

Figure 1: (a) Contour plot of a two-class two-feature density function where the 
marginals are gamma distributed but are jointly multivariate normal, (b) Gaussian cop- 
ula density for pi2 = 0.4, (c) Contourplots of marginally Student's t and jointly Gaus- 
sian random variables. 



Proof: Refer lHa. 

Thus, by Sklar's theorem it is possible to decompose a multivariate distribution 
function into a copula C(-) and marginal distributions Fi, . . . , F^;. This provides a 
bottom-up approach for modeling a multivariate distribution function. First, estimate 
the marginal distributions and then use a copula (Equation |4| to obtain a multivariate 
distribution. 

Definition 3.2 (Copula Density) Copulas do not always have densities. However, 
most of the copulas we study here possess copula densities. Copula density is given 
as follows, 

<''^'---'''d) = . (5) 

For an absolutely continuous joint distribution F with strictly increasing and con- 
tinuous marginal dfs Fi, . . . , Fd, we may differentiate C{xi, . . . , Xd) to see that the 
copula density is given by. 



c{xi, ...,Xd) 



fiF,-\x,),...,F^\xd)) 



cmx,),...,Fdixd)) = 



f{xu---,Xd)=c{Fi{xi),...,Fd{xd))'[[fk{xk)- (7) 

fc=i 
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The knowledge of copula density is particularly useful for estimating parameters of a 
copula. 

Thus a probability density function can be written as a product of copula density 
and marginal density functions. A copula captures the dependence structure between 
marginals without imposing restrictions on the marginal distribution. For example, we 
can easily construct a probability density function wherein c(-) is a Gaussian copula 
density (refer Figure [T] a) , but Fi, . . . , are non-Gaussian (beta, Poisson, exponen- 
tial, etc.). 

Furthermore, by a processes called empirical marginal transform (Section |4|l, we 
can transform individual features to obtain their respective empirical distribution func- 
tion. These empirical distribution functions can replace the unknown distribution func- 
tions. This allows us to further relax assumptions on marginals and leads to a significant 
increase in classification accuracy. 

Now, let us turn our attention to Bayesian classifiers. From (|2]l we replace /(xjwi) 
with /'(xi, . . . , Xd)- Thus from ^ we have, 

d 

fixi,..., Xd) = c'(Fi(xi), . . . , Fdixd)) n fkixk,ei), (8) 

fe=i 

where 6k are parameters of the marginals. From (|2| we simplify further, 

d 

5,(x) = ln{c\Fi{xi),...,Fd{xd))} + ln{l[fk{xk,9l)} + \nP{uj,), 

fc=i 

d 

5,(x) = \n{c\Fi{xi),...,Fd{xd))} + J2Hfkixk,9l)} + \nP{u;i), (9) 

k=l 

if prior probabiUties P((jJi)'s are same for all the classes then, 

d 

5,(x) = ln{c^(Fi(xi), . . . , Faixa))} + ^ ln{fk{xk,9l)}. (10) 

fe=i 

Thus, as shown in (|9]l and ( fTOl i Bayesian classifiers can be derived using copula 
densities. This gives us flexible and true decision boundaries in terms of discriminant 
functions. As it can be observed, these discriminant functions are much superior and 
generalized than their Gaussian counterparts. In Section (|5]l we shall further strength 
the above generaUzation through simulation results. 

Copulas exist in various shapes and forms |[3][lOl. But, the most useful and popular 
copulas are the parametric copulas, namely, the Gaussian and the Student's t copula 
(also, t copula). We define these copulas as follows. 

Definition 3.3 (Gaussian Copula) Let p be a symmetric, positive definite matrix with 
diag{p) = 1 and let $p the standardized multivariate normal distribution with corre- 
lation matrix p. Then the multivariate Gaussian copula is defined as, 

C{uu . . . , «d; p) = ^p{^-\ui), . . . , <^-\ud)), (11) 
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(b) 



(c) 



Figure 2: (a) Student's t copula density for pi2 = 0.4, v ~ 2. (b) and (c) show contour 
and surface plots respectively of marginally Gaussian and jointly Student's t random 
variables 



where <I> ^(u) denotes the inverse of the normal cumulative distribution function. The 
density function of the Gaussian copula is given as. 



c(-Ui,M2, ...,ud;p) 



■ exp 



1 



c'ip-'-i)c 



(12) 



where 1^ — ($ ^(^2),...,$ ^(w^))'. Figure \i(b)\ and Figure [7fcj| depict 

various forms of a bivariate Gaussian copula density. 

Definition 3.4 (Student's t copula) Let pbe a symmetric, positive definite matrix with 
diag{p) — 1 and let Tp ^, the standardized multivariate Student's t distribution with 
correlation matrix p and v degrees of freedom. Then the multivariate Student's t copula 
is defined as follows, 

C{ui,...,ud\p,iy) = Tp,^(t^^(ui), . . . ,t~^(ud)), (13) 

where t~^{u) denotes the inverse of the Student's t cumulative distribution function. 
The associated Student's t copula density is obtained by applying equation. 



c{ui,U2,...,Ud;p, v) = \p 



r(i) 



(1 



n:ii(i + ^) 



(14) 



where ( = (t^ ^(^i), ^(''^2), ■ ■ ■ ^(ud))'- Figure depicts various forms of a 
bivariate Student's t copula density. 



4 Estimation of Copulas 

There are a few well known methods for copula parameter estimation |l2]|6]|9l. Most 
of these methods involve direct maximization of log-likelihood function of the copula 
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density with respect to the parameters. 



4.1 Exact Maximum Likelihood (EML) Method 

Let 6'c G G be the space of parameters. Let {x"' '}^lj^ Vx G M'*, represent Ni training 
samples form the i*^ class, where i = 1, . . . , c. Also, let Cn{dc) and tn{Qc) be the 
likelihood and the log-likelihood functions of the underlying copula for the rt"* sample. 
Then, 

n=Ni Ni d 

n— 1 n— 1 k—1 

Now, we maximize dc with respect to ( fTSl ) as, 

Be = a.rgma.x{£{ec)\ec G 6}. 
Thus for the Gaussian copula the log-likelihood function takes the form, 

Ni 

2 2 



lnlp|-^EC.(/'''-I)Cn.^' (16) 



where 6c — {p\p G M''^''} and p is a symmetric positive definite matrix. Thus as it can 
be seen the estimation of a Gaussian copula is simple and the parameter (p) also have 
a closed form solution, 

* n— 1 

Similarly, we can write £{dc) for the Student's t copula as. 



^ r(|) ; \^ r(f) ; 

n=l 

+'-i^T.T.Hi + ^-f^), (18) 

n=l fc=l 

where dc ~ {{i',p)\i' G (2,oo],p G K'^^'*}. But, the maximization of log-likelihood 
function with respect to the parameters is complicated as it involves simultaneous es- 
timation of the parameters of the dependence structure (p) and the margins (i^). A 
more efficient method known as the canonical maximum likelihood method (CML) is 
proposed by Mashal and Zeevi |j91. We discuss this method in Section (14.21 ). 
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Dataset 


Marginal probability density functions 


Gaussian copula 
discriminant 
(% Accuracy) 


Normal 
discriminant 
(% Accuracy) 


^ ^ 1000 

(% Accuracy) 


C G 2' 
iG{-5,...,15} 
(% Accuracy) 


1 


t2 


99.10 


72.47 


84.53 


98.50 


2 


Gam(4,2) 


80.78 


22.94 


81.67 


93.87 


3 


Exp(0.7) 


76.62 


40.54 


99.20 


99.80 


4 


Gam(4.3,1.7), Log-N(0.64,0.22) 


81.86 


35.06 


79.93 


92.73 


5 


Exp(0.6), Gam(4,2) 


78.82 


34.16 


69.67 


89.13 


6 


Log-N(0.7,0.2), Gam(5,3), Exp(0.5) 


79.86 


39.68 


62.33 


90.53 


7 


Exp(0.32), Gam(3.1,4.3), x^(3.2) 


77.32 


40.62 


61.73 


93.20 


8 


Log-N(0.53,0.36), Gam(6.2,3.3), Exp(0.44), x^(5) 


80.24 


38.92 


65.53 


92.40 



Table 1: Comparison of Gaussian copula discriminant and normal discriminant for 
various datasets. All datasets possess a 100-dimensional feature space. 



4.2 Canonical Maximum Likelihood (CML) Method 

The method described above makes assumption on the distribution of the marginals 
while it performs parameter estimation. However, the CML method does not make any 
distributional assumption. It uses a transformation known as the empirical marginal 
transformation to transform data to an estimate of its empirical distribution. This es- 
timate of the empirical distribution approximates the unknown marginal distribution 
Ffe(-)as, 

1 ^' 

^'=(-)-]^El{x.„<.}, (19) 

* n—l 

where l{xkn<-} ^'^ indicator function. Thus the empirical marginal transformation 
transforms the data to uniform variates. 

Mashal and Zeevi |9| show the effective application of the CML method for cal- 
ibrating a Student's t copula. They propose an robust estimator for p that use rank 
correlation, particularly the Kendall's t. Using this estimator the CML method can be 
summarized as in Algorithm ( 14. Il l: 

Algorithm 4.1 (CML method for Student's t copula) 

1. Transform the data x„ to pseudo-samples u„ using the empirical marginal trans- 
formation; 

2. Estimate the correlation matrix p as in /P/; 

3. Perform the unconstrained maximization for v as, 

v = axg inax Vlogc(M5',u^,...,u2;P,J^)- (20) 
i^e(2,oo ^ — ' 

n— 1 
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Figure 3: Accuracy of Gaussian copula discriminant 

The above method has computational advantages over other methods f9l|. This method 
does not require matrix inversion and therefore has the advantage of being numerically 
stable in the presence of close-to-singular correlation matrices. The estimator for p is 
0{n?) for each coefficient. This is better that the iterative procedure for estimation. 
And, as the maximization is carried out with respect to a single parameter iv), it is 
significantly faster and stable. 

5 Simulation Results 

We simplify our problem to a two-category classification problem. Using synthetic 
datasets we establish the propriety of our approach. Although similar simulation method- 
ology is incorporated for all datasets (see Table l4.lt . due to space limitations we present 
detailed results for only Dataset-1. The features of this dataset are individually Stu- 
dent's t distributed but are jointly multivariate normal. This dataset consists of around 
4000 samples. We us 70% of the samples for training and 30% samples for prediction. 

We start by estimating marginal empirical distribution functions as given in ( fT9] l. 
Then we numerically differentiate the empirical distribution functions to get the empir- 
ical density functions. It is worthwhile to note that we have not made any distributional 
assumption on marginal densities. Later, using the EML method we estimate parame- 
ters of the Gaussian copula density for all classes. Recall, these copula densities lead to 
copula discriminants, which are then used to predict class labels as seen in Section (|2]l. 

For every dimension we evaluate both methods several times. The ensemble av- 
erage of these evaluations is reported in Figure (O. Clearly, the copula discriminant 
function is about 25% more accurate at higher dimensions. This inability of the normal 
discriminant to classify high-dimensional data accurately can be attributed to Hughes 
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phenomenon ||4][TT]. Intuitively, as the dimension increases; accuracy of the normal 
discriminant should increase, since a new feature would only add more information 
about the data. As a result, Bayes error is a decreasing function of the dimensionality 
of data im . However, Figure (|3]l reveals that normal discriminant functions exhibit 
a degradation of accuracy as the dimension increases. This degradation is caused due 
to a relative increase in classification error as compared to the decrease in Bayes error 
lITTi . This increase in classification error is due to the fact that more parameters need to 
be estimated from the same number of training samples. Obviously, if this increase in 
classification error is greater than the decrease in Bayes error the overall performance 
degrades [11 1. This is known as Hughes phenomenon. Lee and Landgrebe |8| argue 
that at high dimensions the value of second order statistics of a class contain more 
information as compared to the first order statistics. Admittedly, copula discriminant 
functions have an unique capability to exploit this higher order information (Figure[3]l. 

In Table (14.1b we summarize simulation results for all datasets. Furthermore, we 
compare the efficacy of our approach with SVMs. We have chosen SVMs since they 
are kernel based non-Bayesian classifiers. Also, they are known to be significantly 
accurate in a variety of classification tasks. 

Each of these dataset bear a 100 dimensional feature space and consist of around 
4000 samples. Here, the individual feature distribution can assume various forms (ei- 
ther Gamma, t^, Lognormal, exponential, or x^) as indicated by the second column in 
Table ( |4TT] i. 

For training and classification using ^VMs we use the SVM^^^^* 15] package with 
an RBF kernel. Search for optimal RBF kernel parameters is performed using a simple 
search algorithm given in ||7][l2l. We restrict the SVM trade-off parameter C < 1000 x 
{maa;(| |r| p)} for all feature vectors r where max{-) is the maximum value and || • 1 1 

indicates Euclidean norm. However, it is also common to use C = {ai;(7(||r|p)} ^ 
where avg{-) indicates arithmetic average. 

We report the accuracy obtained from above methods in Table ( 14. U . Again, it 
evinces that the Gaussian copula discriminant function is about 35% to 40% more 
accurate as compared to the normal discriminant function. The Gaussian copula dis- 
criminant also exhibits an accuracy comparable to SVMs, although we have observed a 
better performance for SVM?, if ln2 C = {—5, 15} \\2\. Accuracy of SVMs, for var- 
ious values of C is presented in column five and column six of Table (14.1b . Admittedly, 
copula discriminants have several advantages over SVMs: 

1 . They obviate solving of a quadratic programming problem, as required by SVMs. 
Also, for obtaining the empirical marginal distribution functions we only have to 
sort the training data (O(nlogn)). 

2. The memory foot-print of copulas is much smaller than SVMs. Only a small 
subset of points representing the marginal densities need to be stored. 

3. Closed form solutions for parameters are available and produce significantly bet- 
ter results. Thus a time consuming search over the parameter space for optimal 
parameters becomes redundant. 
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6 Conclusion 



Copulas provide generalized decision boundaries for Bayesian classifiers. Clearly, this 
generalization is effective and increases classification accuracy significantly. Most 
importantly, this elevation in accuracy is not achieved at the expense of analytical 
tractability. Thus, compared to popular models, copulas provide a superior model for 
pattern classification. However, the method for choosing a copula that best fits the joint 
feature distribution is still at the forefront of research. In our subsequent papers we shall 
explore this issue along with other applications of copulas to statistical learning. 
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